Note: This is older data and the meaningful identifiers have been removed from the compounds (Mass, LC Retention Time). All compounds are only referenced by an arbitrary number (although the compounds were sorted by mass from smallest to largest before being given a number).

Background:

The Motivation

This is a mass spectrometry metabolomics experiment based on a project by a post-doc in my lab, who is in the process of creating a database of oxidation-senstive molecules and the products of their oxidation.

The motivation behind this project is that there is evidence that a variety of diseases may be caused by oxidative stress. Oxidative stress occurs when there is a large quantity of oxidizing molecules present in a cell, tissue, or organism, such that cannot be neautralized by the organism’s antioxidant defenses, which come in a variety of forms and can be produced by the organism itself or obtained from the diet. When oxidative stress is too high, the oxidizing molecules go rogue, chemically speaking, and can damage DNA, proteins, and other molecules within the cell. This type of cellular damage has been linked to diseases such as cancer and heart disease. Therefore, in order to understand and treat these diseases properly, we need to understand the nature of the oxidative stress. However, we have a limited understanding of all of the molecules that are oxidation-senstive, and what molecules might be products of oxidative stress.

The Project

To expand our knowledge about oxidative stress, a project has been initiated in my lab to create a “redoxome” database, which would act as a database of redox-related molecules, and that database should be ideally specific to the biological fluid or tissue of interest. The basic idea is to generate a database of “real” oxidation-sensitive molecules and the molecules produced by oxidative stress by taking a normal sample of the biological material that one is interested in studying, and subjecting it to one or more oxidizing potentials in a chemical oxidation cell (an artificially oxidizing environment).

In the case of this specific experient, there is a base/control “0mV” fraction, and 4 oxidized fractions, “250mV”, “500mV”, “750mV”, and “1000mV”. The fractions were run on our LC/MS system to quantify the abundances of the various molecules present in each sample. The features were then extracted in an “untargeted” way - as in it’s possible to extract features based only on compound mass and retention time on the LC system, without knowing their exact identity.

Generally, it’s a challenge to identify compounds based only on mass and retention time alone, especially as mass increases, because the potential chemical space increases and the ways that the atoms that make up that molecule can be connected increases. One way to identify compounds is to use a database, most labs will have their own database of compounds that they’ve run on their own system (while mass does not vary, retention time will vary with chromatography, and the separation of various isoforms will vary), or labs will adopat a chromatography system matched to a particular database. Identifying compounds based on mass is the next best thing, but it’s never a sure thing.

Unknowns can be subjected to what is known as MS/MS for identification - certain MS instruments can be used to subject samples to two rounds of MS. The first MS step is to select a peak or feature of interest, the second MS step is to fragment the compound. MS/MS is done on each compound of interest one by one to generate a clean fragment spectrum for each compound. The compound structure can then be guessed at because each molecule will break up in particular ways depending on which bonds are the most susceptible to breaking. Even when a compound structure is proposed, a standard is usually bought or synthesized (usually bought, because the standard has to be MS-grade purity with no potential confounding compounds), and the standard is run on the LC/MS system to compare retention time and MS/MS is performed to make sure the fragmentation patterns of the unknown and the standard are a match.

Hopefully it’s becoming clear that identifying a potentially interesting unknown is a lot of work and potententially expensive. Therefore, it’s extremely important to select well!

The Goals

  1. Find potentially interesting biomarkers of oxidative stress.
  2. Create a model that could be used to estimate the oxidative stress in “Potential (mV)” units in a real biological sample, given only the compounds present in that sample and their abundances.
# packages
require(heatmaply)
require(magrittr)
require(tidyverse)
require(cowplot)

Another note that I should make, before getting too far, is that a mass spec can only detect charged molecules (as in a molecule with one or more total positive or negative charges). A MS instrument cannot measure an uncharged netural molecule and there is a number of ways to 1) give molecules a charge and 2) to detect the charged molcule. Also, there are two detection modes:

  • Negative, which is better for picking up the signals of molecules that have a tendency to accept a negative charge
  • Positive, in which the instrument will only pick up signals from positively charged molecules, and is therefore better for quantifying the abundances of molecules that tend to pick up positive charges

Some molecules can be picked up in both modes, others will be detected exclusively in one, but not the other. Depending on whether the quantification is relative or absolute (using a standard curve), the values for the same molecule in the two modes may not agree because molecules tend to be better at picking up one type of charge over another.

The set of samples considered in this analysis was analyzed in both negative and positive mode, and both datasets are analyzed here. n._ will denote negative mode data, whereas p._ will denote positive mode data.

# data
n.data <- read_csv(file = "QTOF_precolumn_redox_neg_untarget_nomassRT.csv", col_names = TRUE)
p.data <- read_csv(file = "QTOF_precolumn_redox_pos_untarget_nomassRT.csv", col_names = TRUE)  

ncol(n.data)
## [1] 948
ncol(p.data)
## [1] 476

There are 946 Compounds (Features) associated with the negative mode samples (948 columns in the n.data because 1 column is sample names and the other contains the oxidizing potential measurement). There are 474 Compounds associated with the postive mode data. Important note: even though the same compound may be detected in both and negative mode, this is not always the case, and the nC1 is not necessarily the same as pC1 (and so on for the other numbered compounds).

Another quick note: n.X stands for negative mode data and p.X stands for positive mode data.

What the data looks like:

n.data
## # A tibble: 25 x 948
##            Samples Potential      nC1      nC2      nC3      nC4      nC5
##              <chr>     <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1   n_mouse_0mV_1         0 72636.50 30439.83 26904.16 32375.49 48462.08
##  2   n_mouse_0mV_2         0 65962.64 29955.17 22888.43 34969.71 52285.44
##  3   n_mouse_0mV_3         0 61630.91 26862.67 19755.00 35382.19 44199.13
##  4   n_mouse_0mV_4         0 73057.04 28769.64 20462.82 32864.51 43231.48
##  5   n_mouse_0mV_5         0 70198.90 28441.28 23492.23 34235.84 42638.38
##  6 n_mouse_250mV_1       250 78256.57 31493.36 22720.86 37443.95 60448.74
##  7 n_mouse_250mV_2       250 75873.97 29715.76 26594.17 36876.72 69686.89
##  8 n_mouse_250mV_3       250 79573.65 30031.75 21298.50 36288.14 59219.87
##  9 n_mouse_250mV_4       250 77711.24 32062.44 19113.70 39309.13 56124.11
## 10 n_mouse_250mV_5       250 80901.66 28178.00 24193.07 37332.25 57955.01
## # ... with 15 more rows, and 941 more variables: nC6 <dbl>, nC7 <dbl>,
## #   nC8 <dbl>, nC9 <dbl>, nC10 <dbl>, nC11 <dbl>, nC12 <dbl>, nC13 <dbl>,
## #   nC14 <dbl>, nC15 <dbl>, nC16 <dbl>, nC17 <dbl>, nC18 <dbl>,
## #   nC19 <dbl>, nC20 <dbl>, nC21 <dbl>, nC22 <dbl>, nC23 <dbl>,
## #   nC24 <dbl>, nC25 <dbl>, nC26 <dbl>, nC27 <dbl>, nC28 <dbl>,
## #   nC29 <dbl>, nC30 <dbl>, nC31 <dbl>, nC32 <dbl>, nC33 <dbl>,
## #   nC34 <dbl>, nC35 <dbl>, nC36 <dbl>, nC37 <dbl>, nC38 <dbl>,
## #   nC39 <dbl>, nC40 <dbl>, nC41 <dbl>, nC42 <dbl>, nC43 <dbl>,
## #   nC44 <dbl>, nC45 <dbl>, nC46 <dbl>, nC47 <dbl>, nC48 <dbl>,
## #   nC49 <dbl>, nC50 <dbl>, nC51 <dbl>, nC52 <dbl>, nC53 <dbl>,
## #   nC54 <dbl>, nC55 <dbl>, nC56 <dbl>, nC57 <dbl>, nC58 <dbl>,
## #   nC59 <dbl>, nC60 <dbl>, nC61 <dbl>, nC62 <dbl>, nC63 <dbl>,
## #   nC64 <dbl>, nC65 <dbl>, nC66 <dbl>, nC67 <dbl>, nC68 <dbl>,
## #   nC69 <dbl>, nC70 <dbl>, nC71 <dbl>, nC72 <dbl>, nC73 <dbl>,
## #   nC74 <dbl>, nC75 <dbl>, nC76 <dbl>, nC77 <dbl>, nC78 <dbl>,
## #   nC79 <dbl>, nC80 <dbl>, nC81 <dbl>, nC82 <dbl>, nC83 <dbl>,
## #   nC84 <dbl>, nC85 <dbl>, nC86 <dbl>, nC87 <dbl>, nC88 <dbl>,
## #   nC89 <dbl>, nC90 <dbl>, nC91 <dbl>, nC92 <dbl>, nC93 <dbl>,
## #   nC94 <dbl>, nC95 <dbl>, nC96 <dbl>, nC97 <dbl>, nC98 <dbl>,
## #   nC99 <dbl>, nC100 <dbl>, nC101 <dbl>, nC102 <dbl>, nC103 <dbl>,
## #   nC104 <dbl>, nC105 <dbl>, ...
summary(n.data$Potential)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     250     500     500     750    1000

There are 5 samples per each oxidizing potential.

n.data$Potential <- factor(n.data$Potential, 
                           levels = c(0, 250, 500, 750, 1000), 
                           labels = c("0", "250", "500", "750", "1000"))
p.data$Potential <- factor(p.data$Potential, 
                           levels = c(0, 250, 500, 750, 1000), 
                           labels = c("0", "250", "500", "750", "1000"))

Metabolomics data tend to have a right-skewed distribution and it’s customary to log-transform the data. PCA is senstive to extreme values and so here, I will log-transform as well before the PCA analysis.

n.data.log <- bind_cols(
  select(n.data, Samples:Potential),
  n.data %>% select(nC1:nC946) %>% log()
  )
p.data.log <- bind_cols(
  select(p.data, Samples:Potential),
  p.data %>% select(pC1:pC474) %>% log()
)

Hierarchical clustering of the negative mode data (samples and compounds):

n.data.log %>%
  select(nC1:nC946) %>%
  as.matrix() %>%
  `rownames<-`(n.data.log$Samples) %>%
  scale(center = TRUE, scale = TRUE) %>%
  heatmap(scale = "none")

Hierarchical clustering of the postive mode data (samples and compounds):

p.data.log %>%
  select(pC1:pC474) %>%
  as.matrix() %>%
  `rownames<-`(p.data.log$Samples) %>%
  scale(center = TRUE, scale = TRUE) %>%
  heatmap(scale = "none")

I normally use mixOmics::cim() to create heatmaps, but it’s not agreeing with me. Trying out heatmap for now, will fix the terrible color scheme later.

PCA Analysis

n.pca <- prcomp(n.data.log[, 3:ncol(n.data.log)], center = TRUE, scale = FALSE)
p.pca <- prcomp(p.data.log[, 3:ncol(p.data.log)], center = TRUE, scale = FALSE)

Proportion of variance explained by each principal component for both sets of data:

n.prop.var <- data.frame(
  cbind(
    1:length(n.pca$sdev),
    n.pca$sdev^2 / sum(n.pca$sdev^2),
    cumsum(n.pca$sdev^2 / sum(n.pca$sdev^2))
    )
  )
colnames(n.prop.var) <- c("PC", "Proportion", "Cumulative")

p.prop.var <- data.frame(
  cbind(
    1:length(p.pca$sdev), 
    p.pca$sdev^2 / sum(p.pca$sdev^2),
    cumsum(p.pca$sdev^2 / sum(p.pca$sdev^2))
    )
  )
colnames(p.prop.var) <- c("PC", "Proportion", "Cumulative")

n.prop.var %>%
  ggplot(aes(x = PC, y = Proportion)) + 
  geom_point(size = 3) +
  geom_line(size = 1.5) +
  xlab("Principal Component") +
  ylab("Proportion of Variance Explained") +
  ggtitle("Proportion of Variance Explained by Each Component\nNegative Mode")

n.prop.var %>%
  ggplot(aes(x = PC, y = Cumulative)) + 
  geom_point(size = 3) +
  geom_line(size = 1.5) +
  xlab("Principal Component") +
  ylab("Cumulative Proportion of Variance Explained") +
  ggtitle("Cumulative Proportion of Variance Explained by Each Additional Component\nNegative Mode")

p.prop.var %>% 
  ggplot(aes(x = PC, y = Proportion)) + 
  geom_point(size = 3) +
  geom_line(size = 1.5) +
  xlab("Principal Component") +
  ylab("Proportion of Variance Explained") +
  ggtitle("Proportion of Variance Explained by Each Component\nPositive Mode")

p.prop.var %>% 
  ggplot(aes(x = PC, y = Cumulative)) + 
  geom_point(size = 3) +
  geom_line(size = 1.5) +
  xlab("Principal Component") +
  ylab("Cumulative Proportion of Variance Explained") +
  ggtitle("Cumulative Proportion of Variance Explained by Each Additional Component\nPositive Mode")

For both datasets, the first principal component explains about 77% of the variance. There’s a rapid drop-off: the second component in both cases explains about 13% of variance, and it’s only downhill from there. In both cases, the first 4 PCs are most likely to be useful.

Which PCs are best for describing the variance that’s associated with oxidative stress in these samples?

n.components <- n.data %>%
  select(Samples:Potential) %>%
  bind_cols(data.frame(n.pca$x))
  
n.components %>%
  ggplot(aes(x = PC1, y = PC2, group = Potential, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC1 vs PC2 \n Negative Mode") +
  xlab("PC1 (77.3% var)") +
  ylab("PC2 (12.9% var)") +
  ylim(-7.0, 7.0)

n.components %>%
  ggplot(aes(x = PC2, y = PC3, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC2 vs PC3 \n Negative Mode") +
  xlab("PC2 (12.9% var)") +
  ylab("PC3 (2.96%)") +
  xlim(-7.0, 7.0) +
  ylim(-3.5, 3.5)

n.components %>%
  ggplot(aes(x = PC3, y = PC4, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC3 vs PC4 \n Negative Mode")  +
  xlab("PC3 (2.96%)") +
  ylab("PC4 (1.98%)") +
  xlim(-3.5, 3.5) +
  ylim(-3.0, 3.0)

It’s clear from the first plot that using PC1 to describe the data leads to the best separation of the groups. Also, it looks like the samples from the different electrical potentials are ordered nicely already, but that can be checked more rigorously with a linear model:

n.comp.num <- as.data.frame(
  cbind(
    as.numeric(as.character(n.components$Potential)), 
    n.components$PC1
    )
  )
colnames(n.comp.num) <- c("Potential", "PC1")
row.names(n.comp.num) <- n.components$Samples

n.PC1.model <- lm(PC1 ~ Potential, data = n.comp.num)
summary(n.PC1.model)
## 
## Call:
## lm(formula = PC1 ~ Potential, data = n.comp.num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9643 -1.1519  0.9274  1.2233  2.3228 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.240e+01  6.094e-01  -20.34 3.35e-16 ***
## Potential    2.479e-02  9.952e-04   24.91  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.759 on 23 degrees of freedom
## Multiple R-squared:  0.9643, Adjusted R-squared:  0.9627 
## F-statistic: 620.5 on 1 and 23 DF,  p-value: < 2.2e-16
n.comp.num %>%
  ggplot(aes(x = Potential, y = PC1, color = Potential)) +
  geom_point(size = 3.5) +
  geom_smooth(method = "lm", color = "black") +
  ggtitle("PC1 vs Potential (mV)\nNegative Mode") +
  xlab("Potential (mV)")

There is a clear strong positive relationship between PC1 and the electrical potential, although it doesn’t look to be perfectly linear.

p.components <- p.data %>%
  select(Samples:Potential) %>%
  bind_cols(data.frame(p.pca$x))
  
p.components %>%
  ggplot(aes(x = PC1, y = PC2, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC1 vs PC2 \n Positive Mode") +
  xlab("PC1 (77.1%)") +
  ylab("PC2 (12.7%)") +
  ylim(-4, 4) +
  xlim(-12.5, 12.5)

p.components %>%
  ggplot(aes(x = PC2, y = PC3, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC2 vs PC3 \n Positive Mode") +
  xlab("PC2 (12.7%)") +
  ylab("PC3 (2.93%)") +
  xlim(-4.5, 4.5) +
  ylim(-3, 3)

p.components %>%
  ggplot(aes(x = PC3, y = PC4, color = Potential)) +
  geom_point(size = 3.5) +
  ggtitle("PC3 vs PC4 \n Positive Mode") +
  xlab("PC3 (2.93%)") +
  ylab("PC4 (1.52%)") +
  xlim(-3, 3) +
  ylim(-3, 3)

PC1 also has the best relationship with electrochemical oxidative potential for the positive mode data. However, the spread of the data points within the groups, especially in the 0mV group, hints that the data quality is questionable.

p.comp.num <- as.data.frame(
  cbind(
    as.numeric(as.character(p.components$Potential)), 
    p.components$PC1
    )
  )
colnames(p.comp.num) <- c("Potential", "PC1")
row.names(p.comp.num) <- p.components$Samples

p.PC1.model <- lm(PC1 ~ Potential, data = p.comp.num)
summary(p.PC1.model)
## 
## Call:
## lm(formula = PC1 ~ Potential, data = p.comp.num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2528 -0.8082  0.0183  1.0149  2.8852 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.1903598  0.5447709  -15.04 2.18e-13 ***
## Potential    0.0163807  0.0008896   18.41 2.91e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.573 on 23 degrees of freedom
## Multiple R-squared:  0.9365, Adjusted R-squared:  0.9337 
## F-statistic: 339.1 on 1 and 23 DF,  p-value: 2.907e-15
p.comp.num %>%
  ggplot(aes(x = Potential, y = PC1, color = Potential)) +
  geom_point(size = 3.5) +
  geom_smooth(method = "lm", color = "black") +
  ggtitle("PC1 vs Potential (mV)\nPositive Mode") +
  xlab("Potential (mV)")

Note that the separation that makes the most sense is along PC1 (0mV to 1000mV spread out along the PC1 axis). PC2 may be interesting, but confusing, as there is some separation between groups, but 0mV and 1000mV are nearly at the same position, 250mV and 750mV are very similar, and 500mV is different from all other groups.PC3 may inform how the 0mV and the 250mV groups differ from the other three groups (and from each other).

n.PC1.load <- tbl_df(n.pca$rotation) %>%
  mutate(Compounds = row.names(data.frame(n.pca$rotation))) %>%
  select(Compounds, PC1) %>%
  arrange(PC1)

n.PC1.load %>%  
  ggplot(aes(sample = PC1)) + 
  stat_qq() +
  ggtitle("QQ Plot\nNegative Mode PC1") +
  ylim(-0.4, 0.4)

#take the bottom 2.5% of loadings and the top 2.5% of loadings
n.select.PC1 <- subset(
  n.PC1.load, 
  PC1 >= quantile(n.PC1.load$PC1, 0.975) | 
    PC1 <= quantile(n.PC1.load$PC1, 0.025)
  )

n.select.PC1
## # A tibble: 48 x 2
##    Compounds        PC1
##        <chr>      <dbl>
##  1     nC642 -0.3100661
##  2     nC552 -0.2875755
##  3     nC170 -0.2446238
##  4     nC553 -0.2238861
##  5     nC643 -0.2195593
##  6     nC705 -0.2086450
##  7     nC706 -0.1967110
##  8     nC346 -0.1869810
##  9     nC424 -0.1767920
## 10     nC418 -0.1476915
## # ... with 38 more rows
# These compounds are much more manageable to plot in a heatmap figure

n.data.log %>%
  select(one_of(n.select.PC1$Compounds)) %>%
  as.matrix() %>%
  scale(center = TRUE, scale = TRUE) %>%
  `rownames<-`(n.data.log$Samples) %>%
  heatmaply(scale = "none")

Looks like there’s a good selection of compounds picked up with this analysis with 2 major groups of compounds, with 2 sub-families in each:

  • A group that’s high in the 0mV samples that decrease in abundance as the oxidizing potential is increased (oxidation-sensitive molecules or precursors).
  • A group of compounds that are low in the 0mV, but are abundant in the oxidized samples. These would be the oxidation products.

Interesting compounds based on pos mode PC1.

p.PC1.load <- tbl_df(p.pca$rotation) %>%
  mutate(Compounds = row.names(data.frame(p.pca$rotation))) %>%
  select(Compounds, PC1) %>%
  arrange(PC1)
  
p.PC1.load %>% 
  ggplot( aes(sample = PC1)) + 
  stat_qq() +
  ggtitle("QQ Plot\nPositive Mode PC1") +
  ylim(-0.3, 0.3)

p.select.PC1 <- subset(
  p.PC1.load,
  PC1 >= quantile(p.PC1.load$PC1, 0.975) |
    PC1 <= quantile(p.PC1.load$PC1, 0.025)
  )

p.select.PC1
## # A tibble: 24 x 2
##    Compounds        PC1
##        <chr>      <dbl>
##  1     pC175 -0.2817509
##  2     pC210 -0.2788191
##  3     pC195 -0.1936996
##  4      pC98 -0.1883896
##  5     pC113 -0.1708991
##  6     pC105 -0.1354455
##  7     pC168 -0.1293160
##  8      pC69 -0.1229430
##  9      pC90 -0.1132040
## 10     pC117 -0.1084555
## # ... with 14 more rows
p.data.log %>%
  select(one_of(p.select.PC1$Compounds)) %>%
  as.matrix() %>%
  scale(center = TRUE, scale = TRUE) %>%
  `rownames<-`(p.data.log$Samples) %>%
  heatmaply(scale = "none")

The positive mode data. as hinted by the PCA analysis, is more variable than the negative mode analysis and the clustered data isn’t as neat. It looks like there’s some compounds that are oxidation-senstive that have been picked up by the analysis, but no oxidation products.

What to do after this point?

  1. Find which compounds overlap between pos and neg mode? Eliminate repeats.
  2. Need a better feature selection procedure - PCA isn’t the best for this purpose and it didn’t work so well on the positive mode data.
  3. Need to create a “test” dataset to test some models on, or at the very least create a model with cross-validation.